Validity and reliability International Classification of Diseases-10 codes for all forms of injury: A systematic review

Background Intentional and unintentional injuries are a leading cause of death and disability globally. International Classification of Diseases (ICD), Tenth Revision (ICD-10) codes are used to classify injuries in administrative health data and are widely used for health care planning and delivery, research, and policy. However, a systematic review of their overall validity and reliability has not yet been done. Objective To conduct a systematic review of the validity and reliability of external cause injury ICD-10 codes. Methods MEDLINE, EMBASE, COCHRANE, and SCOPUS were searched (inception to April 2023) for validity and/or reliability studies of ICD-10 external cause injury codes in all countries for all ages. We examined all available data for external cause injuries and injuries related to specific body regions. Validity was defined by sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Reliability was defined by inter-rater reliability (IRR), measured by Krippendorff’s alpha, Cohen’s Kappa, and/or Fleiss’ kappa. Results Twenty-seven published studies from 2006 to 2023 were included. Across all injuries, the mean outcome values and ranges were sensitivity: 61.6% (35.5%-96.0%), specificity: 91.6% (85.8%-100%), PPV: 74.9% (58.6%-96.5%), NPV: 80.2% (44.6%-94.4%), Cohen’s kappa: 0.672 (0.480–0.928), Krippendorff’s alpha: 0.453, and Fleiss’ kappa: 0.630. Poisoning and hand and wrist injuries had higher mean sensitivity (84.4% and 96.0%, respectively), while self-harm and spinal cord injuries were lower (35.5% and 36.4%, respectively). Transport and pedestrian injuries and hand and wrist injuries had high PPVs (96.5% and 92.0%, respectively). Specificity and NPV were generally high, except for abuse (NPV 44.6%). Conclusions and significance The validity and reliability of ICD-10 external cause injury codes vary based on the injury types coded and the outcomes examined, and overall, they only perform moderately well. Future work, potentially utilizing artificial intelligence, may improve the validity and reliability of ICD codes used to document injuries.


Background
Injuries are a prevalent issue worldwide, as world-wide deaths due to all injuries has increased from 4,260,493 (uncertainty interval: 4,085,700 to 4,396,138) in 1990 to 4,484,722 (4,332,010 to 4,585,554) in 2017 [1].Furthermore, all-injury incidence (i.e., new cases) increased from 354,064,302 (338,174,876 to 371,610,802) in 1990 to 520,710,288 (493,430,247 to 547,988,635) in 2017 [1].Thus, accurate reporting of injuries is critical so healthcare providers, government officials, and policy makers can be informed about injury rates and which types are most prevalent, and for accurate reporting.This allows for an understanding of where public health actions or other healthcare actions may be beneficial to make decisions and take action to prevent injuries and treat them better.Since International Classification of Diseases, Tenth Revision (ICD-10) codes are one of the primary sources of information for reporting diagnoses and are commonly used in research, the analysis of their accuracy is especially important.
ICD codes are used worldwide in all areas of healthcare as a coding system to report diagnoses.In addition to being a coding diagnostic reporting system, they may be used for billing purposes, claims processing, medical care review, classifying data, and for healthcare statistics reporting [2].The ICD codes are the most widely used classification system for hospital records, and approximately 70% of global health expenditure is distributed according to their data [3,4].Therefore, accurate reporting of these codes is essential for maintaining high-quality healthcare data worldwide.
The 10th revision of ICD codes was developed by the World Health Organization (WHO) and is currently used worldwide [3,4].These codes have been in effect since approximately year 2000, though this varies by country.A primary use of the ICD-10 codes is for injury data surveillance and research, for which hospital-managed case records are a main source.The injury ICD-10 codes include codes for the external causes of injury conditions (the circumstances and other characteristics of events that led to injury conditions), and the primary injury outcomes themselves.
Despite their wide use in healthcare, the overall validity and reliability of the ICD-10 codes for external-cause injuries has yet to be examined in a systematic review.Individual studies have reported their validity and reliability for different types of injuries, but an overall analysis of the ICD-10 codes' accuracy to diagnose/identify the correct conditions based on how they are coded has not been reported for these outcomes.Thus, there is a gap in the literature reporting the statistics of whether the ICD-10 codes reported in medical records for external cause injuries accurately describe the patients' diagnoses (i.e., the codes' validity), and whether they are coded consistently (i.e., the codes' reliability).
Studies examining the accuracy of external cause of injury ICD-9 codes (E-codes; within the ICD-9th Revision-Clinical Modification (ICD-9-CM)) found that ICD-9-CM-coded data may be able to use broad external cause code blocks with some confidence, while caution should be exercised for very specific code blocks [5,6].Nevertheless, ICD-10 external cause codes are very different from ICD-9-CM codes as ICD-10 codes have more specificity and a different structure across code blocks [6].
Our study aims to investigate the validity and reliability of ICD-10 codes for external-cause injuries to report the overall accuracy of these codes in identifying the correct diagnoses (i.e., validity), and whether reporting is reproducible amongst individuals coding them (i.e., reliability).We conducted a systematic review of studies reporting on the validity and/or reliability of ICD-10 codes for classifying patients with intentional and unintentional external injuries including all ages and all countries.

Literature search
An extensive search was conducted in Ovid MEDLINE, EMBASE, COCHRANE, and SCO-PUS, from all dates available (1966-2023, 1947-2023, 1996-2023, 1996-2023, respectively).The searches were conducted on the following dates, from database inception to current date: Ovid MEDLINE (April 16/2023), Cochrane Library (April 18/2023), EMBASE (April 18/ 2023), and Scopus (April 19/2023).The searches ran in each database are available as supplementary materials (S1-S4 Texts).Two reviewers (SP and NO) independently screened the studies.Any disagreements were discussed with a third reviewer (MC).Article screening was completed using Covidence software.Also, a supplementary search of the literature was conducted via the authors manually searching the publications in the reference lists of all relevant articles.
A protocol for this study was published on the International Platform of Registered Systematic Review and Meta-analysis Protocols (https://doi.org/10.37766/inplasy2023.8.0022, [7]).Our review was completed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework, and a completed checklist is provided as supplementary information (S1 Checklist).

Inclusion criteria
Studies that examined validity and/or reliability for the specified ICD-10 injury codes were included in the analysis.All studies included must have been peer reviewed, primary articles, published in English, examining humans, and have full-text available.All ages and countries were included as the ICD-10 codes we investigated are primarily uniform across countries.In studies where only some of the codes examined were ICD-10 injury codes, the relevant results were extracted if they are reported as separate outcome values in the paper.
Population.The population examined included patients that experienced an external injury of all ages from any country.The ICD-10 codes used in the inclusion criteria to classify external injuries are summarized in Table 1.This includes resulting injury codes and external cause of the injury codes.Only cases that examined and recorded these injuries with the specified ICD-10 injury codes were included in the analysis.
The ICD-10 codes we selected for our analysis to categorize and present injury data were based on the reliable standards reported by the Association of Public Health Epidemiologists in Ontario (APHEO) [8] and Parachute's 2022 guidelines for ICD-10 code classifications used to document injury causes [9], The codes included are primarily based on the ICD-10-CA codes, as these are applicable to classify injuries in all countries [10].These injury codes overlap across all countries, with a few minor discrepancies which are described in the results section.We divided the available results into ICD-10 code categories for external causes of injuries (i.e., self-harm injuries, abuse, transport and pedestrian injuries, and poisoning) and injuries to body regions (i.e., hand and wrist injuries, brain injuries, spinal cord injuries, lower extremities injuries, and multiple (total body) injury types reported).

Injury ICD10 Codes
Injuries to the head S00-S09 Injuries to the neck S10-S19 Injuries to the thorax S20-S29 Injuries to the abdomen, lower back, lumbar spine, and pelvis

S30-S39
Injuries to the shoulder and upper arm S40-S49 Injuries to the elbow and forearm S50-S59 Injuries to the wrist and hand S60-S69 Injuries to the hip and thigh S70-S79 Injuries to the knee and lower leg S80-S89 Injuries to the ankle and foot S90-S99 Injuries involving multiple body regions T00-T07 Injuries to unspecified parts of trunk, limb, or body region

Burns and corrosions T20-T32
Poisoning by drugs, medicaments, and biological substances T36-T50 Intervention.The intervention evaluated in this review was the validity and/or reliability reported for the specified ICD-10 injury codes.
Comparator.Studies were included which compared the reported ICD-10 injury codes to chart review and/or physician diagnosis as the gold standard (for validity measures) and/or those that compared ICD-10 injury codes between coders or other healthcare workers (i.e., inter-rater reliability (IRR) for reliability measures).
Outcomes.The outcome measures included in the analysis to assess the validity and reliability of external injury ICD-10 codes were: (1) sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for validity, and (2) IRR, measured by Krippendorff's alpha, Cohen's Kappa, and/or Fleiss' kappa, for reliability.

Data extraction
Two reviewers (SP and NO) independently reviewed the full-text articles using Covidence software, and any discrepancies were discussed after independent review.A third reviewer (MC) was consulted for extra discussion if necessary.Zotero software was used for extracting the articles once consensus was reached.The PICO (Population, Intervention, Comparator, Outcomes) inclusion framework was utilized for all screening and full-text review to ensure consistency amongst reviewers via a comprehensive checklist on Excel.This framework is commonly used in systematic reviews in healthcare to ensure high quality literature review and results reporting [11].Thus, papers were screened for the population being injured patients (defined by the external injuries codes listed in Table 1), the intervention being an analysis of ICD-10 codes, the comparator being physician diagnosis and/or chart review, which was evaluated against the recorded ICD-10 codes, and the outcomes being validity (measured as sensitivity, specificity, PPV, and NPV), and reliability (measured as Krippendorff's alpha, Cohen's Kappa, and/or Fleiss' kappa).Only the relevant articles and statistics that met all inclusion criteria were extracted from all papers screened to calculate/report the final summary values.

Quality assessment
All studies included in our analysis were assessed for risk of bias to investigate study quality using an adaption of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool [12].Factors such as the study design, patient population, and comparison to the chosen gold standard all may impact the results of the studies included in our paper.Thus, we used the QUADAS protocol to analyze each study and report these findings to be considered when reviewing our results.Furthermore, this method for quality assessment has been previously used in diagnostic accuracy analyses of ICD codes [13,14].
Each reviewer independently answered the 14 QUADAS questions to assess the quality of all the full-text studies included for these areas of bias.Then, each study was classified as having a high risk of bias, moderate risk of bias, or low risk of bias based on a qualitative assessment.This classification is consistent with previous studies that used QUADAS to examine ICD codes' diagnostic accuracy [13,14].The QUADAS framework used to analyze the studies based on previously published analyses is summarized in S5 Text.Our risk of bias assessment did not include one of the 14 questions from the QUADAS tool (and thus was evaluated out of 13 questions) as it was not applicable to this type of quality assessment.This is consistent with the previous ICD diagnostic accuracy studies completed [12][13][14].

Statistical analysis
The outcomes values including sensitivity, specificity, PPV, and NPV for validity, Krippendorff's alpha, Cohen's kappa, and Fleiss' kappa for reliability, were extracted from all papers and used to calculate a summary value.The ICD-10 injury codes from the inclusion criteria were separated into 9 main injury-based categories by grouping similar injury outcomes.
Ranges and mean values were calculated and reported for each of the outcomes in all the injury categories to provide an overall estimate of the validity and/or reliability of the ICD-10 codes for those injuries.Means were compared amongst injury categories and totaled for overall estimates of validity and reliability.
Our results calculations averaged all individual studies' outcomes, so all studies were weighted equally.This was done to minimize bias in our results to avoid some studies being weighted heavier simply due to the codes being examined multiple times in different ways.However, for the reporting of values when discussing the studies' bias/quality, all values without the averaging of outcomes were reported to analyze the full spectrum of ranges reported without adjustments.Since sample size was not explicitly reported for all studies (e.g., those where injury patients were a portion of the ICD-10 codes reported and only total sample size was provided) this element was not used for weighting in our statistical analysis.

Literature search
We identified 910 records through our original searches (from database inception to April 2023) of the MEDLINE, EMBASE, Cochrane Library, and Scopus databases.Of these, 309 were identified as duplicates, which left 601 articles for title and abstract screening.The search selection framework was completed in accordance with the PRISMA framework, which is summarized in Fig 1 [15].The full-text reports of 27 articles were sought, but three were excluded due to lack of full-text availability (n = 2) or the article being published in French (n = 1).The remaining 24 were assessed for eligibility, of which four were excluded due to not using ICD-10 codes (n = 1), not using chart review/physician diagnosis as a gold standard for evaluating validity (n = 2), or not calculating outcome measures that exclusively correspond to injuries (n = 1), leaving 20 articles.We also identified 479 records from citation searches.From this search, nine articles' full-text reports were assessed for eligibility, with two excluded for not reporting injuries.Thus, a total of 33 articles were assessed for eligibility, of which six were excluded, leaving 27 articles included for this systematic review of external cause of injury codes.

Study characteristics
Demographic variables.Of the 27 articles that were included in the final review, 13 (48%) were from the United States of America (U.S.A.), six (22%) were from Canada, four (15%) were from Australia, two (7%) were from Taiwan, one (4%) was from Iran, and one (4%) was from Norway.Characteristics of all included studies are presented in Table 2. Sample sizes of the injury patients' codes included varied widely between the studies and the codes, ranging from the tens in some studies to the thousands in others (S1 Table ).The records that were reviewed cover a 38-year period (1982 to 2020), with two (7%) articles that analyzed records between 1982 and 2000, 10 (37%) articles that analyzed records between 2001 and 2010, and 20 (74%) articles that analyzed records between 2011 and 2020.
Gold standard.Chart review was used as the gold standard in 22 articles, and direct physician diagnosis based on patient evaluation was used in two articles.The three remaining articles did not use a gold standard, as they only evaluated the IRR of their respective ICD-10 codes of focus.

Quality assessment
The quality of the included studies was evaluated using the QUADAS tool [12].Of the 27 studies, 22 (81%) were categorized as high quality, and the remaining 5 (19%) as medium quality (Fig 2 ).A detailed breakdown of the quality assessment for each study is provided in S2 Table.

Injury categories
Nine main injury categories were used to report the outcomes of interest based on the relevant literature reported within our inclusion criteria.These include external causes of injuries: selfharm injuries, abuse, transport and pedestrian injuries, and poisoning, and, resulting bodily injuries categorized by body parts: hand and wrist injuries, brain injuries, spinal cord injuries, lower extremities injuries, and multiple (total body) injury types reported (i.e., injury/trauma codes reported as groupings of multiple injury types).A detailed breakdown of all the relevant codes that were included and examined from all studies is listed in S3 Table.

Statistical outcomes and data analysis
All relevant results and the summary calculations for ranges and mean value per injury category for each outcome are summarized in S4 Table.Sensitivity and specificity.Sixteen studies examined sensitivity, with 46 outcome values reported, while 12 studies examined specificity, with 33 outcome values reported for the ICD-10 codes being examined.Across the 9 injury categories, the mean sensitivity was 61.6% (range 35.5%-96.0%),while the mean specificity was 91.6% (range of 85.8%-100%).These values are summarized in Fig 3 .PPV and NPV.In the context of this study, positive predictive values assess the ratio of true positive cases to the total number of cases identified by the ICD-10 codes.Negative predictive values assess the ratio of true negative cases to the total number of cases identified by the ICD-10 codes as not having the condition.Twenty-three studies examined positive predictive values, with 61 outcome values reported, while 9 studies examined negative predictive values, with 20 outcome values reported for the ICD-10 codes of interest.Across the 9 injury categories, the mean positive predictive value was 74.9%, (range 58.6%-96.5%),while the mean negative predictive value was 80.2%, (range of 44.6%-94.4%).The values for each injury category are summarized in Fig 4.
Inter-rater reliability.The inter-rater reliability (IRR) evaluation was conducted using 3 measurement tools: Krippendorff's alpha, Cohen's kappa, and Fleiss' kappa.Nine studies examined Cohen's kappa, resulting in 16 reported outcome values.One study also examined Krippendorff's alpha, with 1 outcome value reported.Another study examined reliability using Fleiss' kappa, reporting 1 outcome.Across the 9 injury categories, the mean Cohen's kappa value was 0.672, (range of 0.480-0.928).With limited data for Krippendorff's alpha and Fleiss' kappa, the values yielded were 0.453 and 0.630, respectively.Injury category statistical analysis.The mean and range of each injury outcome are reported in S4 Table.

External causes of injuries outcomes
Transport and pedestrian injuries.Two studies assessed transport and pedestrian injuries, resulting in 8 relevant outcome values reported (S1 Table ).These studies covered bicycle injuries, pedestrian injuries, femur fractures, and transport incident injuries.Sensitivity was examined by both articles and ranged from 33.1%-95.7%(mean 73.8%).Specificity and positive predictive values were examined by one article each, and resulted in values of 100% and 96.5%, respectively.Cohen's kappa ranged from 0.905-0.945(mean 0.928).One of the two studies that reported injury mechanism codes for transport and pedestrian injuries was rated as high quality, while the other was rated as medium quality.There was little difference in the inter-rater reliability scores between the studies: inter-rater reliability ranged from 0.91 to 0.98 in the medium-quality study, and from 0.88 to 0.97 in the high-quality study.However, there was an unclear difference in the sensitivity values, which ranged from 87% to 98% in the medium-quality study, but ranged from 25.0% to 45.0% for one half of the values, and from 90.2% to 98.3% for the other half, in the high-quality study.
Abuse.The overarching injury topic of abuse explored a range of topics, including child physical abuse, assault, sexual abuse, and other forms of maltreatment.Three studies examined injuries as a result of abuse, with 11 relevant outcome values reported (S1 Table ).With each study having a different focus area, it led to a complete and well-rounded exploration of this injury type.Sensitivity and specificity ranged from 31.8%-72.6%(mean 55.6%) and 84.6%-90.8%(mean 88.2%), respectively.Positive predictive values ranged from 71.1-76.0%(mean 73.5%).The assessment of negative predictive values was limited to a single study, resulting in a value of only 44.6%.Similarly, Cohen's kappa and Fleiss' kappa resulted in values of 0.818 and 0.63, respectively.All the studies that reported injury codes for abuse were rated as highquality.
Poisoning.As poisoning is a broad topic, it is important to note that various aspects are included, including toxic effects of carbon monoxide and poisoning from drugs and biological substances.Furthermore, poisoning by opioids, other synthetic narcotics, and psychodysleptics (both intentional and unintentional) were included.Four studies examined injuries caused by poisonings, with 14 relevant outcome values reported (S1 Table ).Sensitivity and positive predictive values ranged from 79.5%-89.3%(mean 84.41%) and 32.8%-84.0%(mean 64.3%), respectively.Specificity and negative predictive values were only assessed by one article and gave values of 85.8% and 92.1%, respectively.Cohen's kappa yielded a value of 0.735.Three of the four studies that reported injury mechanism codes for poisoning (unintentional) were rated as high quality, while the other was rated as medium quality.There was a difference in the sensitivity values, which ranged from 76% to 83% in the medium-quality study, and from 81.2% to 94.9% among the high-quality studies.However, there was an unclear difference in the PPVs, which ranged from 67% to 71% in the medium quality study but ranged from 32.8% to 60.3% for about half of the values, and from 76.6% to 97.9% for the other half, among the high-quality studies (with PPV � 76.6% in two of the three high-quality studies reporting on PPV).

Injuries classified by body parts outcomes
Neurological/spinal cord injuries.This category covers fractures and nerve injuries, as well as spinal cord injuries.Additionally, it examines outcomes related to injuries of the brain and spinal cord, encompassing concussion, edema, and nerve injuries.Three studies assessed spinal cord injuries, resulting in 47 relevant outcome values reported (S1 Table ).Sensitivity and specificity ranged from 0.9%-89.8%(mean 36.4%) and 6.7%-100% (mean 86.3%).Positive predictive values and negative predictive values yielded a range of 30.0%-100% (mean 81.8%) and 10.0%-93.0%(mean 60.8%).Cohen's kappa ranged from 0.56-0.70(mean 0.65).Two of the three studies that reported injury outcome codes for spinal cord injuries (unintentional) were rated as high quality, while the other was rated as medium quality.There was little difference in the PPVs, which ranged from 76.2% to 100.0% in the medium-quality study (with the exception of an outlier: 33.3%), and from 76.0% to 97.0% (with the exception of an outlier: 30.0%) among the high-quality studies.There was a difference in the sensitivity values, which ranged from 0.9% to 33.3% in the medium-quality study, and from 50.0% to 89.8% (with the exception of an outlier: 30.0%) among the high-quality studies.However, there was an unclear difference in the specificity values, which ranged from 98.8% to 100.0% in the medium-quality study, but from 6.7% to 25.8% for about half of the values, and from 97% to 98% for the other half, among the high-quality studies.
Hand and wrist injuries.Hand and wrist injuries encompasses open wounds on the wrists and hands, along with fractures, sprains, and strains of joints and ligaments.One study examined hand and wrist injuries, with 2 relevant outcome values reported (S1 Table ).The values for sensitivity and positive predictive values were 96% and 92%, respectively.This study was rated as high-quality.
Brain injuries.This injury category discusses a range of brain injury outcomes, including skull fractures, concussions, cerebral edema, traumatic brain injuries, hemorrhage, and other intracranial injuries.Additionally, it examines outcomes linked to shaken infant syndrome and unspecified head injuries.Five studies assessed brain injuries, resulting in 13 relevant outcome values reported (S1 Table ).Sensitivity and positive predictive values ranged from 6.8%-81% (mean 53.6%) and 33.3%-100% (mean 74.7%), respectively.Specificity and negative predictive values were only assessed by one article and gave values of 88.0% and 92.8%, respectively.Four of the five studies that reported injury outcome codes for brain injuries (unintentional) were rated as high quality, while the other was rated as medium quality.There was not a clear difference in the PPVs, which ranged from 22.7% to 73.7% (with about half of the values � 40.8%, and the other half � 60.3%) in the medium-quality study, and from 60.6% to 100.0% (with the exception of an outlier: 33.3%) among the high-quality studies.
Lower extremities injuries.As "lower extremities injuries" is a broad topic, the specific factors that are included are ankle fractures, hip fractures (including proximal femur fractures), as well as fractures, sprains and strains of joints and ligaments at the ankle and foot level.Five studies examined injuries in the lower extremities, resulting in 40 relevant outcome values reported (S1 Table ).Sensitivity and specificity ranged from 0%-94.5% (mean 53.7%) and 76.0%-98.2%(mean 90.6%), respectively.Positive predictive values and negative predictive values ranged from 0%-100% (mean 58.6%) and 54.0%-99.0%(mean 89.6%), respectively.Cohen's kappa ranged from 0.26-0.95(mean 0.54).Krippendorff's alpha was reported in one study and yielded a value of 0.453.Four of the five studies that reported injury outcome codes for lower extremity injuries (unintentional) were rated as high quality, while the other was rated as medium quality.There was a difference in the PPVs, which ranged from 91.0% to 100.0% in the medium-quality study, and from 0.0% to 100.0% (most � 43.0%) among the high-quality studies.There was a difference in the sensitivity values, which ranged from 94.0% to 95.0% in the medium-quality study, and from 0.0% to 96.0% (most � 50.0%) among the high-quality studies.Inter-reliability scores ranged from 0.93 to 0.97 in the medium-quality study, and from 0.26 to 0.60 among the high-quality studies.
Multiple (total body) injury types reported.This injury category is broader but includes critical injury types that are essential for a complete analysis from studies that examined multiple injury-types in one analysis.This category discusses external-cause injuries to different body parts, burns, firearm injuries (both accidental and intentional), head and neck injuries, and trauma codes with unspecified details.Four studies examined multiple (total body) injury types, resulting in 17 relevant outcome values reported (S1 Table ).Sensitivity and positive predictive values ranged from 5.0%-89.5% (mean 65.8%) and 5.5%-95.5% (mean 69.2%), respectively.Specificity and negative predictive values were only assessed by one article and yielded values of 98.5% and 94.4%, respectively.Cohen's kappa ranged from 0.15-0.77(mean 0.557).Two of the four studies that reported injury outcome codes for multiple (total body) injury types (unintentional) were rated as high quality, while the other two were rated as medium quality.There was little difference in the PPVs, which was 92.6% in the medium-quality study, and ranged from 93.3% to 95.5% (with the exception of an outlier: 5.5%) among the high-quality studies.There was little difference in the sensitivity values, which was 76.1% in the medium-quality study, and ranged from 66.3% to 89.5% (with the exception of an outlier: 5.0%) among the high-quality studies.There was, however, a difference in the inter-reliability scores, which was 0.15 in the medium-quality study, and ranged from 0.75 to 0.77 among the high-quality studies.

Outcome measures
The values reported across outcomes varied largely depending out the outcome and the injury category, making it difficult to comment on an overall statistic for ICD-10 injury codes, but some key trends in the data stand out.Our findings provide overall summaries for all types of external injuries reported in the literature, as our systematic review is, the first investigation thus far on ICD-10 external injury codes' overall validity and reliability.
Sensitivity and specificity.Mean sensitivity values were generally lower (mean = 61.6%,range = 35.5%-96.0%)while specificity values were high across the studies (mean = 91.6%,range = 85.8%-100%).Importantly, due to the nature of sensitivity and specificity measures, when one of these values increases for a diagnostic accuracy test, naturally the other tends to decrease.Thus, a generally high value for both is better but achieving a very high score on both tests is unlikely.Nevertheless, considering the wide use of ICD-10 codes for injury research, these sensitivity values reported are concerning.Furthermore, no gold-standard "cut-off" values have been widely implemented for what are considered to be high-or low-quality values for sensitivity and specificity of diagnostic accuracy studies.In other research contexts, such as Influenza testing, >90% has been reported as excellent sensitivity/specificity, 80-89% is good, 60-79% is fair, while <60% is considered poor [42].
PPV and NPV.The overall positive predictive value across the studies was better than the sensitivity values, with a mean of 74.9% (range 58.6%-96.5%),though it was still not very high.The negative predictive value was quite high (mean = 80.0%, range = 44.6%-94.4%).Similarly to sensitivity and specificity, no gold standard "cut-off" has been established for the PPV and NPV of diagnostic accuracy studies.However, in other contexts (e.g., pediatric screening tools), values have been reported for PPV and NPV as: >90%: excellent, 80-89%: good, 60-79%: fair, and <60%: poor [43].
The wide range of values reported within sensitivity, specificity, PPV, NPV, and inter-rater reliability may be attributable to the large scope of injuries included in the study, as the training of medical personnel, as well as the common coding practices and definitions of codes used to represent injury types within different areas of medicine (e.g., brain injuries compared to self-harm injuries) varies widely.Causes of discrepancies between the outcomes within injury categories are unclear, though differences in sample sizes and study design elements may be contributing factors (e.g., one researcher versus multiple reviewing the charts).Furthermore, some studies included had a higher risk of bias, as described, which may have impacted the results.Nevertheless, the studies included in our systematic review did have a uniform gold standard of chart review/physician diagnosis and met strict inclusion criteria for the codes and outcomes included.
Injury outcomes.Transport and pedestrian injuries, and hand and wrist injuries, had particularly high PPVs (mean = 96.5% and 92.0%, respectively).The rest of the PPVs were moderate to good.The sensitivity values for these categories (transport/pedestrian and hand/wrist injuries) were also quite good (mean = 73.8% and 96.0%, respectively).A common source of misclassification that may have contributed to the good, rather than excellent, sensitivity value, is a lack of training on reporting and classifying pedestrian/transport injuries in neurologist training programs [5,46] Transport and pedestrian injuries also had particularly high IRR outcomes (mean = 0.93) and specificity values were high for all categories as previously stated.Poisoning codes also had high sensitivity values (mean = 84.4%).The coding discrepancies that may have caused some reduction in this value include that patients with other acute diseases (e.g., burns, and other substance poisoning) may be mistaken for different types of poisoning [21].
Though most sensitivity values were moderate to low, the values for self-harm injuries and spinal cord injuries were particularly low (mean = 36.4% and 35.5%).Some factors that may have contributed to these lower values for self-harm injuries include that a commonly used self-harm code, X84 (intentional self-harm by unspecified means), is more likely to be used in cases of individuals who are Indigenous, those with suicide attempts by cutting, and non-suicidal self-injury in females [32].Similar bias trends in the reporting of intentional self-harm in different groups, including a bias in reporting of young females self-harm cases in hospital data, have been observed in other studies [47].The particularly low spinal cord injury sensitivity values may be attributable to the injury characteristics, such as the severity and the level of the spine trauma not being accurately reported in the coding [18].This was a similar problem found previously in ICD-9 studies that examined the validity and reliability of spinal injuries [48,49].
The NPVs for abuse were also quite low (mean = 44.6%)compared to the other injury categories, which were all high or relatively high values.This may be attributable to only 5% of the study population receiving the ICD-10-CM code Z04.72 (examination and observation following alleged physical abuse) [33].The ICD-10-CM guidelines state that all patients should receive this code, though this is not always the case [50].Furthermore, abuse presents diagnostic challenges as it be inaccurate as proper history is not always taken, and thus may have key omissions [36].Additionally, abuse may be more difficult for physicians to identify than other conditions, such as traffic injuries, as healthcare professional training in abuse is lacking, especially in certain populations such as elder abuse [51].Also, there may be reluctances from physicians to diagnose abuse to due to uncertainty and discomfort with these diagnoses [52].
The rest of the outcomes were generally moderate values, including IRR values across the injury categories which were moderate.Of note, there were no results reported for: sensitivity of hand and wrist injuries, NPVs of hand/wrist injuries and transport and pedestrian injuries, and IRR for hand/wrist injuries and brain injuries.
Causes and solutions for external cause injury coding misclassification.A variety of sources may have contributed to the lower outcome values reported across the injury categories and the discrepancies amongst coders for the external cause injury ICD-10 codes included in our study as a whole.
Health professionals work under time-constraints, which can lead to errors of omission and commission due to inadequate information for coding or unclear documentation [53].Incomplete medical histories also may contribute to these errors, so training staff how to best report codes in these cases, as well as emphasizing proper history documentation practices, would be beneficial [36].
Furthermore, inadequate training of hospital staff for coding, and a lack of a standardized approach for ICD-10 coding, which are affected further by variations in staff experience may all contribute to these errors [5].Thus, more emphasis on training programs that teach accurate coding practices for hospital staff, including admissions staff, providers, and hospital coders could substantially improve common coding misclassifications [5,46].

Comparisons of injury ICD-10 coding to non-injury ICD-10 coding
Other non-external injury conditions, such as those for tic disorders and obsessive-compulsive disorder, have been reported to have a PPV of 97% [54].This high validity may be due to the implementation of increased diagnostic precision of these conditions' psychiatry diagnoses in the Diagnostic and Statistical Manual of Mental Disorders (DSM) resulting in improved diagnostic accuracy [54].As has been shown from Ruck and colleagues [54], clearer and more detailed specifications for actual ICD-10 diagnoses for external cause injuries may reduce coding errors and improve the validity of these codes.This is especially important for conditions that may have a myriad of physical presentations, such different types of abuse, to improve the accuracy of these codes.
Future directions.Although improved physician training programs may have some positive impacts on ICD coding practices, as demonstrated by Paydar & Asadi [55], this alone is likely not enough to significantly improve their validity and reliability.The use of artificial intelligence (AI) for medical record review, such as through natural language processing and deep neural networks to analyze patient files and patient-provided information has been shown to be useful for diagnosis coding practices [56].Machine learning algorithms can be used to gather chart data and generate codes for diagnoses [57].For example, Dewaswala and colleagues [58] reported that natural language processing effectively identified and classified hypertrophic cardiomyopathy patients from narrative text reports in cardiac magnetic resonance imaging with high performance compared to manual annotation.Since natural language processing review of medical records can review documents about each patient to reach a diagnostic category, with more development, they may be more efficient and accurate than traditional methods of coding [58].
Thus, implementing these AI algorithms for assigning ICD-10 codes would be beneficial for more accurate coding and to reduce healthcare staff time and energy spent on this.Digital electronic medical records that force clinicians into certain diagnostic categories also hold hope for improving diagnostic and coding accuracy.Some accurate deep learning models have already been created for automatic ICD-10 coding that show promise for the future development of this technology [57].However, more work is required to integrate these technologies into hospital systems, to train healthcare staff in using them, and to assess the precision of the algorithms' coding before using them regularly.As well, lower income countries, where the major burden of global injury exists, may not have the capacity to introduce expensive electronic medical records.
Furthermore, an analysis investigating the validity and reliability of ICD-10 codes for individual body regions is important to address in future studies, as these codes are also widely used in research and healthcare contexts.

Risks to bias
The risk of differential quality amongst the studies also may have contributed to the discrepancies in the validity and reliability results.In four of the medium-quality studies [30,31,36,40], it was either unclear or confirmed that the researchers' interpretations of the patient medical charts were not independent of their knowledge of the previously assigned ICD-10 codes, and vice-versa (i.e., whether blinding protocols were utilized).The other medium-quality study [26] did not evaluate the validity of the codes and, therefore, did not use chart review in their analysis.Furthermore, one of the medium-quality studies was vague in the description of the execution of chart review to permit its replication [31], and another did not explain withdrawals from the study [30].Finally, another study had an additional child abuse scale only used for a portion of the total sample analyzed [36].
Consistencies in quality.Despite quality discrepancies, the overall studies' quality were good.All 27 studies met the criteria for good quality regarding seven of the 14 questions from the QUADAS tool (Fig 2).The spectra of patients included in all studies were representative of the patients who would receive the test in practice (i.e., injury patients for each particular injury), reducing the overall risk of spectrum bias [12].The selection criteria of the studies included were clearly described, and the chart review/physician diagnosis were likely to correctly classify injury patients.Furthermore, the whole samples, or a random selection of patients, received verification (i.e., were compared to chart review/physician diagnosis), reducing the overall risk of partial verification bias.The comparison of chart review/physician diagnosis to ICD-10 codes was done independently (blinded) in some studies, however this was unclear in some.

Key study strengths
Strengths of our systematic review include the inclusion of all countries in our inclusion criteria for improved generalizability and well-defined, all-encompassing selection criteria for examining ICD-10 external cause injuries.Our study provides key insights for stakeholders who use ICD-10 codes regularly for research, claims processing, health system administration and planning and for policy.

Limitations
Though our study provides insights into the validity and reliability of ICD-10 external cause injury codes, some limitations exist.There was a large variation in the sample size across the studies, which may have introduced some bias in the results calculations.Furthermore, the inclusion of only English studies may have led to a selection bias and one study did not have an English full-text available.Also, two studies were done prior to 2010, so limited years of ICD-10 data were available.Additionally, a common occurrence with systematic review findings, the amount of data varied for different injury outcomes, depending how much literature was reported for each section.Furthermore, some injury outcomes were only reported in one study (e.g., sensitivity for poisoning and NPV for abuse) which are outlined as injury outcomes without error bars depicting the range in Figs 3-5.This limited the number of results included in the study for some outcomes, which could have introduced bias in the results.

Conclusion
Injuries are a significant and growing cause of death and disability and have global economic impact.Our results of the validity and reliability of ICD-10 injury codes indicate that caution needs to be exercised in making conclusions when these codes are used for research or policy.While codes such as transport and pedestrian injuries and poisoning had good validity and reliability, others such as those coding for abuse and self-harm require improvement.Strategies such as much more standardized diagnostic criteria for ICD-10 codes and more comprehensive coding training, are required to improve ICD-10 injury coding accuracy.More widespread use of digital electronic medical records with standardized diagnostic criteria and the use of artificial intelligence techniques that use processes such as natural language processing hold promise for the improvement of coding accuracy and precision into the future.

Fig 1 .
Fig 1. Diagram of study selection and review.Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-style Flowchart of Study Selection and Review.Abbreviations: ICD-10 = International Classification of Diseases, Tenth Revision.https://doi.org/10.1371/journal.pone.0298411.g001 Fig 5 summarizes the mean and range values for the IRR outcomes in each injury category.

Fig 3 .
Fig 3. Sensitivity and specificity outcomes of external cause injury ICD-10 codes.The mean sensitivities (Panel A) and specificities (Panel B), with error bars reflecting the range of values (where reported), from studies that validated ICD-10 codes for injury mechanisms and outcomes in hospitalization data.a. Sensitivity outcomes for all injury categories.b.Specificity outcomes for all injury categories.https://doi.org/10.1371/journal.pone.0298411.g003

Fig 4 .
Fig 4. PPVs and NPVs of external cause injury ICD-10 codes.The mean PPVs (Panel A) and NPVs (Panel B), with error bars reflecting the range of values (where reported), from studies that validated ICD-10 codes for injury mechanisms and outcomes in hospitalization data.a. PPV outcomes for all injury categories.b.NPV outcomes for all injury categories.https://doi.org/10.1371/journal.pone.0298411.g004

Fig 5 .
Fig 5. Inter-rater reliability outcomes for external cause injury ICD-10 codes.The mean inter-rater reliabilities, with error bars reflecting the range of values (where reported), from studies that analyzed the reliability of ICD-10 codes for injury mechanisms and outcomes in hospitalization data.https://doi.org/10.1371/journal.pone.0298411.g005

Table 3 . Summary of the validity and reliability outcomes for all high-quality studies included in the analysis.
All values are reported as mean percentage (with ranges) where data was available.